file format
DISPLIB: a library of train dispatching problems
Kloster, Oddvar, Luteberget, Bjørnar, Mannino, Carlo, Sartor, Giorgio
Oddvar Kloster, Bjørnar Luteberget, Carlo Mannino, Giorgio SartorAbstract Optimization-based decision support systems have a significant potential to reduce delays, and thus improve efficiency on the railways, by automatically re-routing and re-scheduling trains after delays have occurred. The operations research community has dedicated a lot of effort to developing optimization algorithms for this problem, but each study is typically tightly connected with a specific industrial use case. Code and data are seldom shared publicly. This fact hinders reproducibility, and has led to a proliferation of papers describing algorithms for more or less compatible problem definitions, without any real opportunity for readers to assess their relative performance. Inspired by the successful communities around MILP, SAT, TSP, VRP, etc., we introduce a common problem definition and file format, DISPLIB, which captures all the main features of train re-routing and re-scheduling. We have gathered problem instances from multiple real-world use cases and made them openly available. In this paper, we describe the problem definition, the industrial instances, and a reference solver implementation. This allows any researcher or developer to work on the train dispatching problem without an industrial connection, and enables the research community to perform empirical comparisons between solvers.
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Europe > Italy (0.04)
- South America > Brazil > Pernambuco > Recife (0.04)
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Ground > Rail (1.00)
ProdRev: A DNN framework for empowering customers using generative pre-trained transformers
Following the pandemic, customers, preference for using e-commerce has accelerated. Since much information is available in multiple reviews (sometimes running in thousands) for a single product, it can create decision paralysis for the buyer. This scenario disempowers the consumer, who cannot be expected to go over so many reviews since its time consuming and can confuse them. Various commercial tools are available, that use a scoring mechanism to arrive at an adjusted score. It can alert the user to potential review manipulations. This paper proposes a framework that fine-tunes a generative pre-trained transformer to understand these reviews better. Furthermore, using "common-sense" to make better decisions. These models have more than 13 billion parameters. To fine-tune the model for our requirement, we use the curie engine from generative pre-trained transformer (GPT3). By using generative models, we are introducing abstractive summarization. Instead of using a simple extractive method of summarizing the reviews. This brings out the true relationship between the reviews and not simply copy-paste. This introduces an element of "common sense" for the user and helps them to quickly make the right decisions. The user is provided the pros and cons of the processed reviews. Thus the user/customer can take their own decisions.
Multimodal Techniques for Malware Classification
The threat of malware is a serious concern for computer networks and systems, highlighting the need for accurate classification techniques. In this research, we experiment with multimodal machine learning approaches for malware classification, based on the structured nature of the Windows Portable Executable (PE) file format. Specifically, we train Support Vector Machine (SVM), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN) models on features extracted from PE headers, we train these same models on features extracted from the other sections of PE files, and train each model on features extracted from the entire PE file. We then train SVM models on each of the nine header-sections combinations of these baseline models, using the output layer probabilities of the component models as feature vectors. We compare the baseline cases to these multimodal combinations. In our experiments, we find that the best of the multimodal models outperforms the best of the baseline cases, indicating that it can be advantageous to train separate models on distinct parts of Windows PE files.
RadField3D: A Data Generator and Data Format for Deep Learning in Radiation-Protection Dosimetry for Medical Applications
Lehner, Felix, Lombardo, Pasquale, Castillo, Susana, Hupe, Oliver, Magnor, Marcus
In this research work, we present our open-source Geant4-based Monte-Carlo simulation application, called RadField3D, for generating threedimensional radiation field datasets for dosimetry. Accompanying, we introduce a fast, machine-interpretable data format with a Python API for easy integration into neural network research, that we call RadFiled3D. Both developments are intended to be used to research alternative radiation simulation methods using deep learning. All data used for our validation (measured and simulated), along with our source codes, are published in separate repositories.
- North America > United States > New Mexico (0.04)
- Europe > Belgium (0.04)
- North America > Canada (0.04)
- Europe > Germany > Lower Saxony > Hanover (0.04)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Magika: AI-Powered Content-Type Detection
Fratantonio, Yanick, Invernizzi, Luca, Farah, Loua, Thomas, Kurt, Zhang, Marina, Albertini, Ange, Galilee, Francois, Metitieri, Giancarlo, Cretin, Julien, Petit-Bianco, Alex, Tao, David, Bursztein, Elie
The task of content-type detection -- which entails identifying the data encoded in an arbitrary byte sequence -- is critical for operating systems, development, reverse engineering environments, and a variety of security applications. In this paper, we introduce Magika, a novel AI-powered content-type detection tool. Under the hood, Magika employs a deep learning model that can execute on a single CPU with just 1MB of memory to store the model's weights. We show that Magika achieves an average F1 score of 99% across over a hundred content types and a test set of more than 1M files, outperforming all existing content-type detection tools today. In order to foster adoption and improvements, we open source Magika under an Apache 2 license on GitHub and make our model and training pipeline publicly available. Our tool has already seen adoption by the Gmail email provider for attachment scanning, and it has been integrated with VirusTotal to aid with malware analysis. We note that this paper discusses the first iteration of Magika, and a more recent version already supports more than 200 content types. The interested reader can see the latest development on the Magika GitHub repository, available at https://github.com/google/magika.
3D Data Long-Term Preservation in Cultural Heritage
Amico, Nicola, Felicetti, Achille
In digital heritage, effective management and preservation of digital data are crucial. Issues such as file corruption, media obsolescence, and inadequate metadata must be addressed, alongside data migration when software becomes outdated and thorough data curation to aid current and future researchers in searching, citing, and reusing historical data. Merely archiving or backing up project data is not enough for long-term preservation. It is essential to ensure that primary data remain reusable, compatible with evolving operating systems, and accompanied by comprehensive metadata detailing their creation and history [1]. Despite the advantage of heritage datasets being "born digital," they are still susceptible to loss if file associations and metadata are not properly maintained. The large volume of data generated from digital projects and the often limited understanding of file associations among project members jeopardise the future reuse of archaeological data if not well-organised or curated. Enhancing workflows to include both metadata authorship and preservation is vital to prevent information loss and digital data obsolescence. Particularly, the long-term preservation of 3D datasets requires maintaining each file in a usable and uncorrupted state. Files undergo several modifications, changing formats during the creation of the final scan or 3D model, known as an asset.
- North America > Canada > British Columbia > East Kootenay Region > Fernie (0.04)
- Europe > Middle East > Cyprus (0.04)
- North America > United States > Michigan (0.04)
- (8 more...)
- Research Report (1.00)
- Workflow (0.88)
- Education (1.00)
- Information Technology > Security & Privacy (0.92)
- Law (0.68)
On the Abuse and Detection of Polyglot Files
Koch, Luke, Oesch, Sean, Chaulagain, Amul, Dixon, Jared, Dixon, Matthew, Huettal, Mike, Sadovnik, Amir, Watson, Cory, Weber, Brian, Hartman, Jacob, Patulski, Richard
A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.
- North America > United States > Tennessee > Anderson County > Oak Ridge (0.05)
- Asia > South Korea (0.04)
- North America > United States > New York (0.04)
- (4 more...)
- Research Report (0.82)
- Overview (0.68)
I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey
Lewis, Noah, Bez, Jean Luca, Byna, Suren
Because of the increased popularity of Machine Learning (ML) workloads, there is a rising demand for I/O systems that can effectively accommodate their distinct I/O access patterns. Write operation bursts commonly dominate traditional workloads; however, ML workloads are usually read-intensive and use many small files [99]. Due to the absence of a well-established consensus on the preferred I/O stack for ML workloads, numerous developers resort to crafting their own ad-hoc algorithms and storage systems to cater to the specific requirements of their applications [50]. This can result in sub-optimal application performance due to the under-utilization of the storage system, prompting the necessity for novel I/O optimization methods tailored to the demands of ML workloads. In Figure 1, we show the evolving I/O stack used for running ML workloads (on the right side) in comparison with the traditional HPC I/O stack (on the left side). Traditional HPC I/O stack has been developed to support massive parallelism. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- North America > United States > Washington > King County > Renton (0.04)
- (15 more...)
- Energy (0.93)
- Information Technology > Services (0.67)
- Government > Regional Government > North America Government > United States Government (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.66)
Advanced Knowledge Extraction of Physical Design Drawings, Translation and conversion to CAD formats using Deep Learning
M, Jesher Joshua, V, Ragav, P, Syed Ibrahim S
The maintenance, archiving and usage of the design drawings is cumbersome in physical form in different industries for longer period. It is hard to extract information by simple scanning of drawing sheets. Converting them to their digital formats such as Computer-Aided Design (CAD), with needed knowledge extraction can solve this problem. The conversion of these machine drawings to its digital form is a crucial challenge which requires advanced techniques. This research proposes an innovative methodology utilizing Deep Learning methods. The approach employs object detection model, such as Yolov7, Faster R-CNN, to detect physical drawing objects present in the images followed by, edge detection algorithms such as canny filter to extract and refine the identified lines from the drawing region and curve detection techniques to detect circle. Also ornaments (complex shapes) within the drawings are extracted. To ensure comprehensive conversion, an Optical Character Recognition (OCR) tool is integrated to identify and extract the text elements from the drawings. The extracted data which includes the lines, shapes and text is consolidated and stored in a structured comma separated values(.csv) file format. The accuracy and the efficiency of conversion is evaluated. Through this, conversion can be automated to help organizations enhance their productivity, facilitate seamless collaborations and preserve valuable design information in a digital format easily accessible. Overall, this study contributes to the advancement of CAD conversions, providing accurate results from the translating process. Future research can focus on handling diverse drawing types, enhanced accuracy in shape and line detection and extraction.
Poker Hand History File Format Specification
This paper introduces the Poker Hand History (PHH) file format, designed to standardize the recording of poker hands across different game variants. Despite poker's widespread popularity in the mainstream culture as a mind sport and its prominence in the field of artificial intelligence (AI) research as a benchmark for imperfect information AI agents, it lacks a consistent format that humans can use to document poker hands across different variants that can also easily be parsed by machines. To address this gap in the literature, we propose the PHH format which provides a concise human-readable machine-friendly representation of hand history that comprehensively captures various details of the hand, ranging from initial game parameters and actions to contextual parameters including but not limited to the venue, players, and time control information. In the supplementary, we provide over 10,000 hands covering 11 different variants in the PHH format. Building on our previous work on PokerKit, a premier poker hand simulation tool, we demonstrate the usages of our open-source Python implementation of the PHH parser. The source code of the parser is available on GitHub: https://github.com/uoftcprg/pokerkit
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Nevada > Clark County > Las Vegas (0.05)
- North America > United States > Texas > Kleberg County (0.04)
- (12 more...)